Add coordinator module and corresponding unit test by NumberWan · Pull Request #1465 · vllm-project/vllm-omni

NumberWan · 2026-02-25T04:26:10Z

Part of #984 - adds OmniCoordinator module

Summary

Add OmniCoordinator module for distributed stage instance coordination.

ADD

OmniCoordinator: receives instance events via ZMQ ROUTER, publishes via PUB
OmniCoordClientForStage: stage instances send registration/heartbeat
OmniCoordClientForHub: hub side caches instance list
LoadBalancer: RandomBalancer for instance selection
Unit tests for above components

chatgpt-codex-connector

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 1dd914a75a

ℹ️ About Codex in GitHub

Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you

Open a pull request for review
Mark a draft as ready
Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".

vllm_omni/distributed/omni_coordinator/omni_coordinator.py

vllm_omni/distributed/omni_coordinator/omni_coord_client_for_hub.py

… test Signed-off-by: Jeff Wan <wantszkin2003@gmail.com>

Signed-off-by: Jeff Wan <wantszkin2003@gmail.com>

vllm_omni/distributed/omni_coordinator/load_balancer.py

vllm_omni/distributed/omni_coordinator/messages.py

vllm_omni/distributed/omni_coordinator/omni_coordinator.py

tests/distributed/omni_coordinator/test_omni_coordinator.py

lishunyang12

Left a few comments. Main concern is the TOCTOU race in _handle_event — the check-then-act on self._instances without holding the lock can cause duplicate registrations under concurrent events. The rest are smaller cleanups (license headers, type annotations, hardcoded test ports).

…types, dynamic ports Signed-off-by: Jeff Wan <wantszkin2003@gmail.com>

hsliuustc0106

PR #1465 Review: Add coordinator module and corresponding unit test

Overview

This PR implements the OmniCoordinator module for distributed stage instance coordination (Major PR 1 from #984). The implementation closely follows the design document with a few important deviations.

Critical Issues: 4 found

1. Instance Registry Never Removes DOWN Instances (Design Deviation)

File: omni_coordinator.py:271-276

The design document states instances should be "removed from the internal registry" when DOWN, but the implementation only marks them as DOWN and keeps them forever:

def _remove_instance_locked(self, event: InstanceEvent) -> None:
    info.status = StageStatus.DOWN  # Only marks, never removes

Impact: Memory leak, unbounded registry growth over time.

2. ZMQ Context Singleton Causes Test/Production Issues

Files: omni_coordinator.py:51, omni_coord_client_for_stage.py:27, omni_coord_client_for_hub.py:28

Using zmq.Context.instance() shares a global context across all components:

self._ctx = zmq.Context.instance()  # Global singleton

Impact: Tests interfere with each other, closing one coordinator can break others, cannot run multiple coordinators.

Recommendation: Use per-coordinator context: self._ctx = zmq.Context()

3. No Connection State Machine / Reconnection Logic

Files: omni_coord_client_for_stage.py, omni_coord_client_for_hub.py

The design mentions reliability with retry mechanisms, but there's no reconnection logic if the ZMQ connection drops. Network blips will cause silent failures with no recovery path.

Recommendation: Add connection state machine with exponential backoff reconnection.

4. Scalability Limit Not Documented

The single-threaded coordinator has inherent scalability limits:

Instances	Messages/sec	Status
< 500	< 150	✅ OK
500-2000	150-600	⚠️ Caution
> 2000	> 600	🔴 Bottleneck risk

Recommendation: Add max_instances parameter or document expected limits.

Strengths

✅ Good test coverage matching design spec
✅ Message schemas match design document exactly
✅ Clean separation of concerns (messages, coordinator, clients)
✅ Thread-safe with proper locking
✅ ZMQ best practices (RCVTIMEO, NOBLOCK)

Recommendation

Address issues #1 and #2 before merge. Issue #3 (reconnection) could be documented as Phase 2 work if needed.

vllm_omni/distributed/omni_coordinator/omni_coordinator.py

vllm_omni/distributed/omni_coordinator/omni_coord_client_for_stage.py

Signed-off-by: Jeff Wan <wantszkin2003@gmail.com>

…ontexts Signed-off-by: Jeff Wan <wantszkin2003@gmail.com>

… when disconnected Signed-off-by: Jeff Wan <wantszkin2003@gmail.com>

lishunyang12

All threads resolved LGTM

lishunyang12 · 2026-03-04T14:06:57Z

All my previous comments are addressed — thanks for the quick fixes. I'll defer to @hsliuustc0106 on the remaining items from their review (instance removal, reconnection logic).

NumberWan requested a review from hsliuustc0106 as a code owner February 25, 2026 04:26

chatgpt-codex-connector bot reviewed Feb 25, 2026

View reviewed changes

vllm_omni/distributed/omni_coordinator/omni_coordinator.py Outdated Show resolved Hide resolved

vllm_omni/distributed/omni_coordinator/omni_coordinator.py Outdated Show resolved Hide resolved

vllm_omni/distributed/omni_coordinator/omni_coord_client_for_hub.py Outdated Show resolved Hide resolved

NumberWan force-pushed the omni-coordinator branch from 8526123 to 3a85b30 Compare February 25, 2026 06:41

feat(omni_coordinator): add coordinator module and corresponding unit…

cda6939

… test Signed-off-by: Jeff Wan <wantszkin2003@gmail.com>

NumberWan force-pushed the omni-coordinator branch from f6c4544 to cda6939 Compare February 25, 2026 07:57

NumberWan added 2 commits February 26, 2026 14:33

edited messages

026f35a

Signed-off-by: Jeff Wan <wantszkin2003@gmail.com>

edited messages

0381d5b

Signed-off-by: Jeff Wan <wantszkin2003@gmail.com>

NumberWan mentioned this pull request Feb 27, 2026

[RFC]: Omni Coordinator - OmniCoordinator & Load Balancer JiusiServe/vllm-omni#87

Open